main_dataset |>
mutate(temp_celsius = ifelse(scale == "Fahrenheit", (temp-32) * conversion_factor, temp))Lecture 9:
Exploratory Data Analysis and Visualization, Part I
2024-11-29
Long and wide data. Source: Hugo Tavares
Overview by R4DS:
| dplyr (tidyverse) | base::merge |
|---|---|
inner_join(x, y) |
merge(x, y) |
left_join(x, y) |
merge(x, y, all.x = TRUE) |
right_join(x, y) |
merge(x, y, all.y = TRUE), |
full_join(x, y) |
merge(x, y, all = TRUE) |
Source: Intro to R for Social Scientists
Consider the following data frame:
main_dataset
city temp scale conversion_factor
1 StGallen 12 Celsius 1.0000000
2 Zürich 14 Celsius 1.0000000
3 Detroit 40 Fahrenheit 0.5555556
Select all the statements that are true.
replaces the variable temp with temp_celsius
has 3 rows and 5 columns
returns a tibble containing 2 columns and 1 row
is a good way to get summary statistics about the variable scale
Consider the following code:
Select statements that are true:
inner_join(temperature, temperature_conversions, by="scale") returns a data frame with 3 rows.left_join(temperature, temperature_conversions, by="scale") returns a data frame with 3 rows.full_join(temperature, temperature_conversions, by="scale") returns a data frame with 3 rows.Consider the two following dataframes:
> df_c <- data.frame(id = c(1:3,1:3,5),
+ money_spent= c(1000, 2000, 6000, 1500, 3000, 5500,3000),
+ currency = c("CHF", "CHF", "USD", "EUR", "CHF", "USD", "CAD"),
+ year=c(2017,2017,2017,2018,2018,2018,2018))
> df_c
id money_spent currency year
1 1 1000 CHF 2017
2 2 2000 CHF 2017
3 3 6000 USD 2017
4 1 1500 EUR 2018
5 2 3000 CHF 2018
6 3 5500 USD 2018
7 5 3000 CAD 2018
> df_mtm <- data.frame(
+ id = rep(1:3, 2),
+ year = rep(c(2017,2017,2017,2018,2018,2018), 2),
+ first_name = rep(c("Anna", "Betty", "Claire"), 2),
+ profession = rep(c("Economist", "Data Scientist",
+ "Data Scientist"), 2)
+ )
>
> df_mtm
id year first_name profession
1 1 2017 Anna Economist
2 2 2017 Betty Data Scientist
3 3 2017 Claire Data Scientist
4 1 2018 Anna Economist
5 2 2018 Betty Data Scientist
6 3 2018 Claire Data Scientist
7 1 2017 Anna Economist
8 2 2017 Betty Data Scientist
9 3 2017 Claire Data Scientist
10 1 2018 Anna Economist
11 2 2018 Betty Data Scientist
12 3 2018 Claire Data Scientist
Now consider the following two lines of code:
lj_1 <- left_join(df_mtm, df_c, by = "id")
lj_2 <- left_join(df_mtm, df_c, by = c("id", "year"))
Describe what these two lines of code do. What would be the output? What is the problem with the first line?
Next week (05.12.2024):
“Understanding and reacting to customer dissatisfaction through data: A case study” by Rachel Lund, Deloitte
Exam and LockDown Browser: check Sharepoint on StudentWeb and test on Canvas
Ideas for the last course
Course evaluation: thanks for making this course
a success! 🤓
Source: R4DS, *“Statistics for Public Policy” by Jeremy G. Weber, 2024.
Rsummary()table()skimr, summarytools, janitor (also cleaning)Rmean()swiss data# Load the built-in 'swiss' dataset from the package "datasets"
data("swiss")
swiss <- rownames_to_column(swiss, var = "municipality")
# Add an outlier in Lausanne
swiss[swiss$municipality == "Lausanne", "Infant.Mortality"] <- 100
# Add a NA for Agriculture in Gruyere
swiss[swiss$municipality == "Gruyere", "Agriculture"] <- NA municipality Fertility Agriculture Examination Education Catholic Infant.Mortality
1 Courtelary 80.2 17.0 15 12 9.96 22.2
2 Delemont 83.1 45.1 6 9 84.84 22.2
3 Franches-Mnt 92.5 39.7 5 5 93.40 20.2
4 Moutier 85.8 36.5 12 7 33.77 20.3
5 Neuveville 76.9 43.5 17 15 5.16 20.6
6 Porrentruy 76.1 35.3 9 7 90.57 26.6
7 Broye 83.8 70.2 16 7 92.85 23.6
8 Glane 92.4 67.8 14 8 97.16 24.9
9 Gruyere 82.4 NA 12 7 97.67 21.0
10 Sarine 82.9 45.2 16 13 91.38 24.4
11 Veveyse 87.1 64.5 14 6 98.61 24.5
12 Aigle 64.1 62.0 21 12 8.52 16.5
13 Aubonne 66.9 67.5 14 7 2.27 19.1
14 Avenches 68.9 60.7 19 12 4.43 22.7
15 Cossonay 61.7 69.3 22 5 2.82 18.7
16 Echallens 68.3 72.6 18 2 24.20 21.2
17 Grandson 71.7 34.0 17 8 3.30 20.0
18 Lausanne 55.7 19.4 26 28 12.11 100.0
19 La Vallee 54.3 15.2 31 20 2.15 10.8
20 Lavaux 65.1 73.0 19 9 2.84 20.0
21 Morges 65.5 59.8 22 10 5.23 18.0
22 Moudon 65.0 55.1 14 3 4.52 22.4
23 Nyone 56.6 50.9 22 12 15.14 16.7
24 Orbe 57.4 54.1 20 6 4.20 15.3
25 Oron 72.5 71.2 12 1 2.40 21.0
26 Payerne 74.2 58.1 14 8 5.23 23.8
27 Paysd'enhaut 72.0 63.5 6 3 2.56 18.0
28 Rolle 60.5 60.8 16 10 7.72 16.3
29 Vevey 58.3 26.8 25 19 18.46 20.9
30 Yverdon 65.4 49.5 15 8 6.10 22.5
31 Conthey 75.5 85.9 3 2 99.71 15.1
32 Entremont 69.3 84.9 7 6 99.68 19.8
33 Herens 77.3 89.7 5 2 100.00 18.3
34 Martigwy 70.5 78.2 12 6 98.96 19.4
35 Monthey 79.4 64.9 7 3 98.22 20.2
36 St Maurice 65.0 75.9 9 9 99.06 17.8
37 Sierre 92.2 84.6 3 3 99.46 16.3
38 Sion 79.3 63.1 13 13 96.83 18.1
39 Boudry 70.4 38.4 26 12 5.62 20.3
40 La Chauxdfnd 65.7 7.7 29 11 13.79 20.5
41 Le Locle 72.7 16.7 22 13 11.22 18.9
42 Neuchatel 64.4 17.6 35 32 16.92 23.0
43 Val de Ruz 77.6 37.6 15 7 4.97 20.0
44 ValdeTravers 67.6 18.7 25 7 8.65 19.5
45 V. De Geneve 35.0 1.2 37 53 42.34 18.0
46 Rive Droite 44.7 46.6 16 29 50.43 18.2
47 Rive Gauche 42.8 27.7 22 29 58.33 19.3
swiss data municipality Fertility Agriculture Examination Education
Length:47 Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
Class :character 1st Qu.:64.70 1st Qu.:35.60 1st Qu.:12.00 1st Qu.: 6.00
Mode :character Median :70.40 Median :54.60 Median :16.00 Median : 8.00
Mean :70.14 Mean :50.60 Mean :16.49 Mean :10.98
3rd Qu.:78.45 3rd Qu.:67.72 3rd Qu.:22.00 3rd Qu.:12.00
Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
NA's :1
Catholic Infant.Mortality
Min. : 2.150 Min. : 10.80
1st Qu.: 5.195 1st Qu.: 18.15
Median : 15.140 Median : 20.00
Mean : 41.144 Mean : 21.64
3rd Qu.: 93.125 3rd Qu.: 22.20
Max. :100.000 Max. :100.00
summarytools() to generate an overview (does not render well on slides)Data Frame Summary
swiss
Dimensions: 47 x 7
Duplicates: 0
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
|---|---|---|---|---|---|---|
| 1 | municipality [character] | 1. Aigle 2. Aubonne 3. Avenches 4. Boudry 5. Broye 6. Conthey 7. Cossonay 8. Courtelary 9. Delemont 10. Echallens [ 37 others ] | 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 1 ( 2.1%) 37 (78.7%) | 47 (100.0%) | 0 (0.0%) | |
| 2 | Fertility [numeric] | Mean (sd) : 70.1 (12.5) min < med < max: 35 < 70.4 < 92.5 IQR (CV) : 13.7 (0.2) | 46 distinct values | : . : : : : . . : : : . . : : : : : | 47 (100.0%) | 0 (0.0%) |
| 3 | Agriculture [numeric] | Mean (sd) : 50.6 (23) min < med < max: 1.2 < 54.6 < 89.7 IQR (CV) : 32.1 (0.5) | 46 distinct values | : : : : . . : : : : : : : : : : : : : : : : : : | 46 (97.9%) | 1 (2.1%) |
| 4 | Examination [integer] | Mean (sd) : 16.5 (8) min < med < max: 3 < 16 < 37 IQR (CV) : 10 (0.5) | 22 distinct values | : : : . : : : . : : : : : : : : : : : . | 47 (100.0%) | 0 (0.0%) |
| 5 | Education [integer] | Mean (sd) : 11 (9.6) min < med < max: 1 < 8 < 53 IQR (CV) : 6 (0.9) | 19 distinct values | : : : : : : : . | 47 (100.0%) | 0 (0.0%) |
| 6 | Catholic [numeric] | Mean (sd) : 41.1 (41.7) min < med < max: 2.1 < 15.1 < 100 IQR (CV) : 87.9 (1) | 46 distinct values | : : . : : : : : : . : | 47 (100.0%) | 0 (0.0%) |
| 7 | Infant.Mortality [numeric] | Mean (sd) : 21.6 (12) min < med < max: 10.8 < 20 < 100 IQR (CV) : 4.1 (0.6) | 38 distinct values | : . : : : : : : : : | 47 (100.0%) | 0 (0.0%) |
# A tibble: 2 × 2
`Catholic > 50` `mean(Fertility)`
<lgl> <dbl>
1 FALSE 66.2
2 TRUE 76.5
swiss |>
group_by(Catholic > 50) |>
summarize(across(.cols = c(Fertility, Education),
.fns = list("min" = min, "mean" = mean, "max" = max))) # A tibble: 2 × 7
`Catholic > 50` Fertility_min Fertility_mean Fertility_max Education_min Education_mean
<lgl> <dbl> <dbl> <dbl> <int> <dbl>
1 FALSE 35 66.2 85.8 1 12.1
2 TRUE 42.8 76.5 92.5 2 9.11
# ℹ 1 more variable: Education_max <int>
data <- data.frame(
Gender = c("Male", "Female", "Male", "Female", "Male", "Female"),
Participation = c(1, 0, 1, NA, 0, 1),
Age = c(30, NA, 35, 28, 40, NA)
)
data
# Complete case analysis
complete_case <- na.omit(data)
# Mean imputation for missing Age
mean_impute <- data |>
mutate(Age = ifelse(is.na(Age), mean(Age, na.rm = TRUE), Age)) |>
mutate(Participation = ifelse(is.na(Participation), mean(Participation, na.rm = TRUE), Participation))
# Analyze mean age by gender
data %>%
group_by(Gender) %>%
summarize(
Mean_Age = mean(Age, na.rm = TRUE),
Participation_Rate = mean(Participation, na.rm = TRUE)
)
complete_case %>%
group_by(Gender) %>%
summarize(
Mean_Age = mean(Age, na.rm = TRUE),
Participation_Rate = mean(Participation, na.rm = TRUE)
)
mean_impute %>%
group_by(Gender) %>%
summarize(
Mean_Age = mean(Age, na.rm = TRUE),
Participation_Rate = mean(Participation, na.rm = TRUE)
)na.omit()f(..., na.rm = TRUE) like in mean(), sum(), etc.Use what we just saw in the lecture to solve the following problem 💪. You have the following dataset:
Two ways: display data through tables or graphs.
Depends on the purpose.
# load packages and data
library(tidyverse)
data("swiss")
# compute summary statistics
swiss_summary <- swiss |>
summarise(avg_education = mean(Education),
avg_fertility = mean(Fertility),
N = n()
)
swiss_summary avg_education avg_fertility N
1 10.97872 70.14255 47
Problems?
format()-functionformat() exampletoupper()/tolower().trimws(),gtExtras and sparklines Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Connecticut 3.3 110 77 11.1
Delaware 5.9 238 72 15.8
Florida 15.4 335 80 31.9
Georgia 17.4 211 60 25.8
Showing a raw data frame is not visualization… Problems?
gtExtras and sparklineslibrary(gtExtras)
USArrests_summary <- USArrests |>
mutate(UrbanPop = case_when(UrbanPop > quantile(UrbanPop, .66) ~ "High",
UrbanPop > quantile(UrbanPop, .33) ~ "Middle",
UrbanPop > 0 ~ "Low")) |>
group_by(UrbanPop) |>
summarize(
"Mean murder" = mean(Murder),
"SD murder" = sd(Murder),
Density = list(Murder)
)
USArrests_summary# A tibble: 3 × 4
UrbanPop `Mean murder` `SD murder` Density
<chr> <dbl> <dbl> <list>
1 High 8.07 3.80 <dbl [17]>
2 Low 7.41 5.20 <dbl [17]>
3 Middle 7.89 4.18 <dbl [16]>
gtExtras and sparklinesUSArrests_summary |>
gt() |>
tab_header(
title = md("Murder rates"),
subtitle = md("Per high, middle, and low urban population ")
) |>
gtExtras::gt_plt_dist(Density, type = "density", line_color = "black",
fill_color = "red") %>%
fmt_number(columns = `Mean murder`:`SD murder`, decimals = 2)| Murder rates | |||
|---|---|---|---|
| Per high, middle, and low urban population | |||
| UrbanPop | Mean murder | SD murder | Density |
| High | 8.07 | 3.80 | |
| Low | 7.41 | 5.20 | |
| Middle | 7.89 | 4.18 | |
kable() for html / Markdown reportsstargazer for your LaTeX reports or for your Office Word reportskable()Three main approaches:
graphics package ([@r_2018]; shipped with the base R installation).Three main approaches:
graphics package ([@r_2018]; shipped with the base R installation).lattice package [@lattice_2008], an implementation of the original Bell Labs ‘Trellis’ system.Three main approaches:
graphics package ([@r_2018]; shipped with the base R installation).lattice package [@lattice_2008], an implementation of the original Bell Labs ‘Trellis’ system.ggplot2 package [@wickham_2016], an implementation of Leland Wilkinson’s ‘Grammar of Graphics’.Three main approaches:
graphics package ([@r_2018]; shipped with the base R installation).lattice package [@lattice_2008], an implementation of the original Bell Labs ‘Trellis’ system.ggplot2 package [@wickham_2016], an implementation of Leland Wilkinson’s ‘Grammar of Graphics’.ggplot2 is so good that it has become THE reference [In python, use plotnine to apply the grammar of graphics.]
ggplot2ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot().ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot().ggplot2 basicsUsing ggplot2 to generate a basic plot in R is quite simple. Basically, it involves three key points:
data.frame/tibble (in tidy format!).ggplot().swiss Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary 80.2 17.0 15 12 9.96 22.2
Delemont 83.1 45.1 6 9 84.84 22.2
Franches-Mnt 92.5 39.7 5 5 93.40 20.2
Moutier 85.8 36.5 12 7 33.77 20.3
Neuveville 76.9 43.5 17 15 5.16 20.6
Porrentruy 76.1 35.3 9 7 90.57 26.6
Code a province as ‘Catholic’ if more than 50% of the inhabitants are catholic:
ggplotLink: https://rstudio.github.io/cheatsheets/html/data-visualization.html
Look at the graph below. What is wrong with it? Create your own version of the graph.
A Design Problem
Use the following data to create your own version of the graph:
dataChallenge <- data.frame(
Location = rep(c("Bahamas Beach", "French Riviera", "Hawaiian Club"), each = 3),
Fiscal_Year = rep(c("FY93", "FY94", "FY95"), times = 3),
Revenue = c(
200000, 300000, 400000, # Bahamas Beach (FY93, FY94, FY95)
250000, 350000, 500000, # French Riviera (FY93, FY94, FY95)
150000, 450000, 600000 # Hawaiian Club (FY93, FY94, FY95)
)
)